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A MIXTURE OF EXPERTS MODEL FOR RANK DATA WITH 
APPLICATIONS IN ELECTION STUDIES 

By Isobel Claire Gormley^ and Thomas Brendan Murphy^ 

University College Dublin 

A voting bloc is defined to be a group of voters wlio liave similar 
voting preferences. The cleavage of the Irish electorate into voting 
blocs is of interest. Irish elections employ a "single transferable vote" 
electoral system; under this system voters rank some or all of the 
electoral candidates in order of preference. These rank votes provide 
a rich source of preference information from which inferences about 
the composition of the electorate may be drawn. Additionally, the 
influence of social factors or covariates on the electorate composition 
is of interest. 

A mixture of experts model is a mixture model in which the model 
parameters are functions of covariates. A mixture of experts model 
for rank data is developed to provide a model-based method to cluster 
Irish voters into voting blocs, to examine the influence of social factors 
on this clustering and to examine the characteristic preferences of 
the voting blocs. The Benter model for rank data is employed as the 
family of component densities within the mixture of experts model; 
generalized linear model theory is employed to model the influence 
of covariates on the mixing proportions. Model fitting is achieved 
via a hybrid of the EM and MM algorithms. An example of the 
methodology is illustrated by examining an Irish presidential election. 
The existence of voting blocs in the electorate is established and it is 
determined that age and government satisfaction levels are important 
factors in influencing voting in this election. 

1. Introduction. The President of Ireland is elected every seven years by 
the Irish electorate through a preferential voting system known as the single 
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transferable vote (STV). Under this system voters rank some or all of the 
presidential candidates in order of preference. An intricate vote counting 
process involving the elimination of candidates and the transfer of votes 
results in the election of one candidate as President. 

A voting bloc is defined to be a group of voters who have similar voting 
preferences. The cleavage of any electorate into voting blocs is of interest to 
political scientists, politicians and voters. The cleavage of the Irish electorate 
is of particular interest, given the detailed, multi-preference votes expressed 
under the STV voting system. All the information expressed in the ranked 
preferences of the votes must be exploited in order to determine the true 
composition of the electorate. Further, the influence of social factors on the 
voting bloc membership of a voter is also of interest. 

This work aims to establish the presence of voting blocs within the 1997 
Irish presidential electorate, and to determine the characteristic voting pref- 
erences of these blocs. Additionally, the influence of social factors on the vot- 
ing bloc memberships of voters is explored. The ranked nature of the voting 
data is modeled using the Benter model for rank data [Benter (1994)] and 
a mixture model of these distributions provides a model-based approach to 
clustering voters into voting blocs [Gormley and Murphy (2008a)]. Voting 
bloc membership probabilities are treated as multinomial logistic functions 
of the social factors (or covariates) associated with a voter. Such a mixture 
model in which the membership probabilities are functions of covariates is a 
mixture of experts model [Jacobs et al. (1991)]. Thus, a mixture of experts 
model for rank data is developed, thereby extending the mixture model for 
rank data developed in [Gormley and Murphy (2006, 2008a)] by including 
covariate information to aid the characterization of the mixture components. 

Section 2 details the setting of the 1997 Irish presidential election and 
provides an example of the mechanics of the STV vote counting process. A 
mixture of experts model for rank data is formulated in Section 3 and unique 
model fitting aspects are discussed in Section 4. Model fitting is achieved via 
a hybrid of the popular EM algorithm [Dempster, Laird and Rubin (1977)] 
with the MM algorithm [Hunter and Lange (2004), Lange, Hunter and Yang 
(2000)] to produce an algorithm which we call the Expectation Minorization 
Maximization (EMM) algorithm. The mixture of experts model for rank 
data is fitted to the 1997 presidential electorate and the resulting model 
parameter estimates are discussed in Section 5. We compare the results in 
this analysis to other analyses of the data in Section 6. The article concludes 
with a discussion of the developed methodology. 

2. Irish presidential elections. Irish presidential elections employ the 
Single Transferable Vote (STV) system. Under this system voters rank some 
or all of the electoral candidates in order of preference. The votes are to- 
talled through a series of counts, where candidates are eliminated and their 
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votes are transferred between candidates. An in-depth description of the 
electoral system, including the method of counting votes, is given in Sinnott 
(1999) and good introductions to the Irish political system are given in 
Coakley and Gallagher (1999) and Sinnott (1995). Further, an illustrative 
example of the manner in which votes are counted and transferred follows in 
Section 2.2. We start with a description of data from the 1997 presidential 
election in Section 2.1. 

2.1. The 1997 presidential election. The current President of Ireland, 
Mary McAleese, is in her second term of office. Originally elected in 1997, 
she was automatically re-elected in 2004 as the only validly nominated can- 
didate. 

In the 1997 presidential election there were five candidates: Mary Ban- 
otti, Mary McAleese, Derek Nally, Adi Roche and Rosemary Scallon. As 
detailed in Table 2, some candidates were endorsed by political parties and 
others were independent candidates. Mary McAleese had a high public pro- 
file and received the backing of Fianna Fail, who were the main political 
party in the coalition government at the time. Mary Banotti was another 
high profile candidate who was endorsed by the main government opposi- 
tion party, Fine Gael. Adi Roche was supported by the Labour party, who 
were also a government opposition party. The remaining two candidates ran 
on independent tickets. In terms of campaign themes, Mary Banotti, Derek 
Nally and Adi Roche were considered to be liberal candidates, whereas Mary 
McAleese and Rosemary Scallon were deemed more conservative candidates. 
A detailed description of the entire presidential election campaign, including 
the nomination and selection of candidates, is given in Marsh (1999). 

An opinion poll conducted by Irish Marketing Surveys one month prior 
to the election is analyzed in this article. Interviews were conducted on 
1100 respondents, drawn from 100 sampling areas. Interviews took place at 
randomly located homes, with respondents selected according to a socioeco- 
nomic quota. A range of sociological questions was asked of each respondent, 
as was their voting preference, if any, for each of the candidates. These pref- 
erences were utilized as a statement of the intended voting preferences of 
each respondent. Of the respondents interviewed, 17 indicated that they 
did not intend to vote — these respondents were excluded from the analysis. 
Table 1 details the set of sociological covariates recorded in the poll. 

2.2. The vote counting process. A brief overview of the vote counting 
process is given here. For illustrative purposes, the transfer of votes in the 
1997 Irish presidential election is shown in Table 2. 

Under the STV electoral system, a "quota" of votes is calculated which 
is dependent on the number of seats available and the number of valid votes 
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cast. Specifically, the quota is computed as 

total valid votes cast ^ 
number of seats to be filled + 1 

Thus, for the 1997 presidential election, where 1,269,836 valid votes were 
cast and a single presidential seat was to be filled, the quota was calcu- 
lated to be 634,919. Once any candidate at any counting stage obtained or 
exceeded 634,919 votes, this candidate was deemed elected as President of 
Ireland. As detailed in Table 2, in the first stage of the counting process the 
number of first preference votes obtained by each candidate is totaled. No 
candidate received enough first preference votes to exceed the quota. Mary 

Table 1 

The set of covariates recorded in the presidential election opinion poll and the associated 
levels (in the case of categorical variables). An explanation of the socioeconomic group 
codes are provided in Appendix B 



Age Area 


Gender Government 


Marital 


Socioeconomic 




satisfaction 


status 


group 


— City 


Housewife Satisfied 


Married 


AB 


Town 


Nonhousewife Dissatisfied 


Single 


CI 


Rural 


Male No opinion 


Widowed 


C2 








DE 








F50+ 








F50- 




Table 2 






The transfer of votes in the 1997 presidential election. 


The quota required to be elected 


President of Ireland was 634,919. Mary McAleese (denoted in bold font) was elected 


Candidate 


Endorsing party 


Count 1 


Count 2 


Mary Banotti 


Fine Gael 


372,002 


+ 125,514 








497,516 


Mary McAleese 


Fianna Fail 


574,424 


+131,835 








706,259 


Derek Nally 


Independent 


59,529 


-59,529 








Eliminated 


Adi Roche 


Labour 


88,423 


-88,423 








Eliminated 


Rosemary Scallon 


Independent 


175,458 


-175,458 








Eliminated 


Nontransferable votes 




+ 66,061 








66,061 


Total valid votes 




1,269,836 


1,269,836 
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McAleese received the largest number of first preference votes with 45% of 
the vote share. Candidates Nally, Roche and Scallon were eliminated from 
the election race after the first count, as the sum of their votes was less than 
the votes of the next lowest candidate (Mary Banotti). 

At the second stage of counting, Nally, Roche and Scallon's 323,410 first 
preference votes were transferred to the candidates given the next valid 
preference on those ballot papers. Of votes to be transferred, 66,061 were 
nontransferable because only a single preference was expressed on these bal- 
lots or lower preferences on the ballots were for eliminated candidates. Mary 
McAleese received 131,835 of the transferred votes and was therefore elected 
at the second counting stage, as she exceeded the quota with 706,259 votes. 

3. A mixture of experts model for rank data. The mixture of experts 
(MoE) model [Jacobs et al. (1991), Jordan and Jacobs (1994)] combines the 
ideas of mixture models [McLachlan and Peel (2000)] and generalized linear 
models [McCullagh and Nelder (1983), Dobson (2002)]. A mixture model is 
used to model the heterogeneous nature of a population; generalized linear 
model theory provides the statistical structure within the mixture. 

MoE models account for the relationship between a set of response and 
covariate variables where it is assumed that the conditional distribution 
of the response given the covariates is a finite mixture distribution. The 
conditional probability of voter i's ballot Xj, given their associated covariates 
Wi, is 

K 
k=l 

where K denotes the number of components (or expert networks) in the 
mixture, the gating network coefficient = T^kiuii) is the probability of 
voter i being a complete member of expert network k and 9^ represents the 
parameters of the probability model of the kth expert network. In the current 
context, an expert network corresponds to a voting bloc in the electorate. 
A more general mixture of experts model would allow the expert network 
parameters to depend on the covariates Wj, however, such a model would be 
difficult to interpret in terms of voting blocs. 

The gating network coefficients are weighting probabilities constrained 
such that they are nonnegative and sum to one for each voter. The prob- 
ability of voter i's ballot according to the expert networks in the mixture 
model are blended by the gating network coefficients to produce an overall 
probability. Thus, the probability of voter i's ballot is a convex combination 
of the output probabilities from the expert networks. Figure 1 provides a 
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Expert Nt'twoi'k 2. 



Fig. 1. Graphical illustration of the structure of a single layer mixture of experts model 
with two expert networks. The probabilities of voter i 's ballot according to the expert net- 
works F{x^\9i) and P{x^\92) are blended by the gating network coefficients nn and ni2 
to produce an overall probability of voter i 's ballot. The gating network coefficients are 
assumed to be a function of voter i 's covariates w_i ■ 



graphical illustration of the structure of a single layer MoE model — a hierar- 
chical MoE model consists of multiple layers of expert networks and gating 
networks. 

Traditional MoE models, such as those fitted in Jordan and Jacobs (1994), 
Jacobs et al. (1991) and Peng, Jacobs and Tanner (1996), employ probabil- 
ity densities for the expert networks which are members of the exponential 
family, that is, the traditional MoE model has the form of a mixture of 
generalized linear models. In the context of STV voting data, however, the 
expert network probability densities must appropriately model the ranked 
nature of the data. Thus, the Benter model for rank data [Benter (1994)] is 
employed; full details are provided in Section 3.1. 

As illustrated in Figure 1, the gating network coefficients are assumed to 
be functions of the voter covariates. The intuition here is that the covariates 
of a voter determine their voting bloc membership and, in turn, their char- 
acteristic voting preferences. Specifically, the gating network coefficients are 
assumed to be multinomial logistic functions of the voter covariates; details 
are provided in Section 3.2. 

The tree-like structure of MoE models naturally induces comparisons to 
other tree-based classification methods, such as Classification and Regression 
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Trees (CART) [Breiman et al. (2006)] or Multivariate Adaptive Regression 
Splines (MARS) [Friedman (1991)]. Both CART and MARS are nonpara- 
metric techniques which provide a hard partition of the data space — using 
these tools each voter would be classified as belonging to one and only one 
voting bloc. In contrast, the statistical models underlying the expert network 
probability densities mean MoE models are parametric in nature. Addition- 
ally, the MoE model provides a probabilistic "soft" partition of the space 
in that data points may belong to multiple expert networks, that is, un- 
der the MoE model each voter has an associated probability of belonging 
to each voting bloc. Further comparison of these methods is provided in 
Peng, Jacobs and Tanner (1996) and Bishop (2006). 

3.1. Benter's model for rank data. In previous versions of the MoE model 
[Jacobs et al. (1991), Jordan and Jacobs (1994), Peng, Jacobs and Tanner 
(1996)] it is assumed that each component of the mixture model (i.e., each 
expert network) produces its output as a generalized linear function of input 
predictor variables. Within the context of STV voting data, each expert 
network must appropriately model the ranked nature of the data. Thus, it 
is assumed that each expert network is a Benter model distribution for rank 
data [Benter (1994)]. Each expert network is characterized by a differently 
parameterized Benter model where the parametrization is constant with 
respect to the voter covariates. It would be possible to allow covariates 
to contribute to the expert networks [see Jordan and Jacobs (1994) and 
Peng, Jacobs and Tanner (1996)], but this is not examined here due to the 
fact that interpreting the expert networks in terms of voting blocs would be 
difficult. 

The Benter model for rank data has two parameters — a support parameter 
and a dampening parameter: 

(i) Support parameter. Within expert network k, the support parameter 
vector is denoted = {pki, ■ ■ ■ ,PkN), where < pkj < 1, J2jLiPkj = 1 and 

denotes the number of candidates available for selection. The support 
parameter pi^j may be interpreted as the probability of candidate j being 
given a first preference by a complete member of voting bloc k. 

(ii) Dampening parameter. The global dampening parameter vector is de- 
noted by Q = (ai, . . . , qat), where at G [0, 1] for t = 1, . . . , A^. To avoid over- 
parametrization of the model, the constraints qi = 1 and qat = are im- 
posed. The dampening parameters model the way in which some preferences 
may be chosen less carefully than other preferences within a ballot. 

Let c{i,t) denote the candidate ranked in tth. position by voter i and Ui 
be the total number of preferences expressed by voter i. Given the Benter 
model parameters, the probability of voter i's ballot (conditional on voter i 
being a complete member of voting bloc k) is 
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Pfe|^fc) = Pfeb,'«)=Kc(.,i)- 



Pkc{i,2) Pkc(i,ni) 



(3.1) 



n 



Pkc{i,t) 



t=l 




Thus, the Benter model states that the probability of a rank ballot is the 
product of the probabilities of each chosen candidate being ranked first 
where, at each preference level, the probabilities are appropriately normal- 
ized to account for the fact that the cardinality of the choice set has been 
reduced. Moreover, at preference level t, the care with which a preference is 
made is modeled by "dampening" each probability by at- 

Under the Benter model, the log odds of selecting candidate a over candi- 
date b at preference level t is atlog{pka/Pkb)- Thus, the tth level dampening 
parameter at can be interpreted as how the log odds of selecting candidate 
a over candidate b is affected by the selection being made at preference level 
t. 

The Benter model has been successfully employed to model rank data 
[see Gormley and Murphy (2006, 2008a)], but alternative rank data models 
are also available; the Plackett-Luce model for rank data [Plackett (1975)] 
is a special case of the Benter model in which the dampening parameter 
vector is constrained such that a = l. Under the Plackett-Luce model, it is 
assumed that a voter makes their choice at each preference level with equal 
certainty. Mixtures of Plackett-Luce models have been fitted to rank data in 
Gormley and Murphy (2006) and the fitting algorithms for these models are 
more efficient due to the fixed a value. The Benter and Plackett-Luce models 
are both multistage ranking models [Marden (1995)]; in Fligner and Verducci 
(1986) this large class of models are defined as those which decompose the 
ranking process into a series of independent stages. Such models have an 
"item-effect" approach [Fligner and Verducci (1986)] in that the probability 
of the preference of one item over another is the element of interest. Another 
set of rank data models are "distance" based such as those based on Mal- 
low's model [Mallows (1957)]; in such models the probability of observing 
a ranking x decreases as the distance between x and the modal ranking y 
increases. Other distance based approaches are detailed in Gordon (1979) 
and Fligner and Verducci (1986). Cluster analysis via mixtures of distance 
based models is described in Murphy and Martin (2003) and Busse et al. 
(2007). Given the type of choice process undertaken by a voter when gener- 
ating an STV ballot paper, the Benter model for rank data was deemed the 
most applicable in this context. 
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3.2. Gating network coefficients and generalized linear models. The gat- 
ing network coefficients in tlie MoE model can be viewed as tlie success 
probabilities from a generalized linear model. In particular, the success prob- 
ability of belonging to each of K expert networks is a multinomial logistic 
function of the covariates (see Figure 1). Voter i's gating network coeffi- 
cients vTj = (vTji, vri2, . . . , vri/<) are modeled by a logistic function of their L 
covariates = {wn, 11)12, • • • > wn), that is, 



where expert network 1 is used as the baseline expert network and (3ko is 
an intercept term. Similar methodology was employed in Jordan and Jacobs 
(1994) and Peng, Jacobs and Tanner (1996) when modeling the gating net- 
work coefficients. 

4. Fitting the MoE model via the EMM algorithm. To determine the 
composition and voting characteristics of the Irish electorate, estimates of 
the Benter model parameters and of the gating network coefficients are re- 
quired. Model fitting of the MoE model is achieved in Jacobs et al. (1991) 
and Jordan and Jacobs (1994) via the Expectation Maximization (EM) al- 
gorithm. Estimation of the MoE model within the Bayesian framework is 
detailed in Peng, Jacobs and Tanner (1996), in which Markov chain Monte 
Carlo methods [Tanner (1996)] are used. An alternative approach to esti- 
mation of the MoE parameters within the Bayesian framework through the 
use of variational methods is detailed in Bishop and Svensen (2003). 

In this article parameter estimation is achieved through a hybrid algo- 
rithm known as the "EMM" algorithm. As the name implies, the EMM algo- 
rithm combines the well-known EM algorithm [Dempster, Laird and Rubin 
(1977)] with ideas from the MM algorithm [Lange, Hunter and Yang (2000)]. 

4.1. The EM algorithm for the MoE model. The EM algorithm is most 
commonly known as a technique to produce maximum likelihood estimates 
(MLEs) of model parameters in settings where the data under study is in- 
complete or when optimization of the likelihood would be simplified if an 
additional set of variables were known. The iterative EM algorithm consists 
of an expectation (E) step followed by a maximization (M) step. Generally, 
during the E step the expected value of the log likelihood of the complete 
data (i.e., the observed and unobserved data) is computed. In the M step the 
expected log likelihood is maximized with respect to the model parameters. 
In practice, the imputation of latent variables often makes maximization of 
the expected log likelihood feasible. The parameter estimates produced in 
the M step are then used in a new E step and the cycle continues until con- 
vergence. The parameter estimates produced on convergence are estimates 
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which achieve at least a local maximum of the likelihood function of the 
data. 

It is difficult to directly obtain MLEs from the likelihood of the MoE 
model for M rank observations: 

M K 

£(/3,p,a|x,w) =p(x|w,/3,p,a) = \Y^T^ik{w^'P{xi\pj^,a). 

i=l k=l 

To alleviate this problem, the data is augmented by imputing latent vari- 
ables. For each voter i = 1, . . . , M, the latent variable = {zn, . . . , Zix) is 
imputed where Zik takes the value 1 if voter i is a complete member of ex- 
pert network k and the value otherwise. This provides the complete data 
likelihood 

M K 

-Cc(/3,P,a|x,z,w) =p(x,z|w,/3,p,a) = J| '[[{Trik{wi)'P{xi\p^,a)Y"' , 

i=lk=l 

the expectation of (the log of) which is obtained in the E step of the EM 
algorithm. Details are provided in Appendix C, but, in brief, the E step 
consists of replacing the missing data z with their expected values z. In the 
M step the complete data log likelihood, computed with the estimates z, is 
maximized to provide estimates of the Benter parameters p and a and the 
gating network parameters /5. 

The EM algorithm for fitting the MoE model for rank data is straight- 
forward in principle, but the M step is difficult in practice. This is largely 
due to the complex form of the Benter model density (3.1) and the large 
parameter set. A modified version of the EM algorithm, the Expectation 
and Conditional Maximization (ECM) algorithm [Meng and Rubin (1993)], 
is therefore employed. In the ECM algorithm, the M step consists of a series 
of conditional maximization steps. Again, in the context of the MoE model 
for rank data, these maximizations are not straight forward and, thus, the 
conditional M step is implemented using the MM algorithm. 

4.2. The MM algorithm. The MM algorithm is a summary term for opti- 
mization algorithms which operate by transferring optimization from the ob- 
jective function of interest to a more tractable surrogate function. Good sum- 
maries of the methodology are provided in Lange, Hunter and Yang (2000), 
Hunter and Lange (2004) and Hunter (2004). The initials MM depend on 
the type of optimization required. In a maximization problem MM stands for 
minorize and maximize; in a minimization problem, majorize and minimize. 
A minorizing (or majorizing) surrogate function is constructed by exploiting 
mathematical properties of the objective function or of terms within it. The 
MM philosophy is that iteratively optimizing a suitable surrogate function 
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Fig. 2. A graphical illustration of one iterative step in a maximization MM algorithm. 
A minorizing surrogate function g{6\9") (in red) is first fitted to the objective function 
f{6) (in black) at the parameter value 6". Maximizing this minorizing surrogate function 
provides a new parameter estimate 6"'^^ . A new minorizing surrogate function is fitted 
(in blue) to the objective function at 9"^^^. The process continues, driving the objective 
function uphill, until the parameter estimates converge, indicating that at least a local 
maximum of the objective function has been reached. 

drives the objective function uphill or downhill as required. Iteratively max- 
imizing a minorizing surrogate function produces a sequence of parameter 
estimates which converges to at least a local maximum of the objective func- 
tion. A graphical illustration of the mechanics of the MM algorithm is given 
in Figure 2. Details of the stability of MM algorithms and their relation 
to the EM algorithm (the EM algorithm is in fact an MM algorithm) are 
detailed in Lange, Hunter and Yang (2000) and Hunter and Lange (2004). 

In the context of fitting MoE models for rank data, the optimization 
problems in the conditional M step of the EM algorithm are overcome by 
embedding several iterations of the MM algorithm in place of the conditional 
M step. Details of the construction of the necessary surrogate functions are 
provided in Appendix C. 

5. The MoE model for rank data and the Irish electorate. The MoE 

model for rank data was fitted to the set of voters polled in the Irish Mar- 
keting Surveys opinion poll detailed in Section 2.1. For reasons of numerical 
stability and ease of interpretation, covariates were initially standardized 
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such that < wu < 1, where wu denotes the value of the Ith covariate for 
voter i. A single layer MoE model rather than a hierarchical model was as- 
sumed to be sufficient in this context due to the small number of candidates 
in the presidential race. 

Within a single layer MoE model, the number K of expert networks 
(or voting blocs) present in the electorate needs to be estimated. In 
Jordan and Jacobs (1994) K is chosen to be the value which minimizes a test 
set error rate; the variational Bayes approach taken in Bishop and Svensen 
(2003) provides a framework in which both the number of expert networks 
in and the topology of the MoE model may be estimated. The Bayesian 
Information Criterion (BIC) [Schwartz (1978)] is utilized here to select the 
optimal number of experts. The BIC is an information criterion motivated 
by the aim of minimizing the Kullback-Leibler information of the true model 
from the fitted model. The usual justification for the BIC is that, for regular 
problems, it is an approximation of the Bayes factor for comparing mod- 
els under certain prior assumptions [Kass and Raftery (1995)]. The BIC is 
defined as 

(5.1) BIC = 2(maximized likelihood) — (number of parameters) log(M), 

where M is the total number of data points. The BIC trades off model fit 
[assessed by the first term in (5.1)] against model complexity [assessed by 
the second term in (5.1)]. The use of BIC for model selection is not com- 
pletely accepted; Gelman and Rubin (1995) and Raftery (1995) provide two 
contrasting views. Although mixture models do not satisfy the conditions 
necessary for the Bayes factor approximation to hold, there is much in the 
literature to support its use in this context [see, e.g., Leroux (1992), Keribin 
(1998), Keribin (2000) and Fraley and Raftery (1998)]. Within the context 
of this application, BIC gives reasonable results in terms of the voting blocs 
found. 

As with any iterative procedure, starting values may be influential on 
the output of the algorithm. Starting values for the Benter support param- 
eters, dampening parameters and missing membership labels were obtained 
by initially running the EMM algorithm for 500 iterations for a straight for- 
ward mixture of Benter models (i.e., the gating network coefficients are not 
treated as functions of the voter covariates). Good starting values for the 
gating network parameters were then obtained by running 1000 of the logis- 
tic regression style M steps [see (C.5)] of the EMM algorithm. The full EMM 
algorithm to provide MLEs of the model parameters was then iterated un- 
til convergence as deemed by Aitken's acceleration criterion [Bohning et al. 
(1994)]. Subsequent to convergence, approximate standard errors of the 
MLEs were calculated as detailed in McLachlan and Krishnan (1997) and 
McLachlan and Peel (2000). 
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Table 3 

The five best fitting MoE models as deemed by the BIC. Larger BIC values indicate better 
fitting models. The number of expert networks K and the associated covariates of the 

models are also reported 



BIC 


K 


Covariates 


-8490.43 


4 


Age 

Government satisfaction 


-8498.59 


3 


Age 


-8507.33 


3 


Age 

Government satisfaction 


-8511.37 


3 


Government satisfaction 


-8512.62 


5 


Age 

Government satisfaction 



The MoE model was fitted over the range K = 1,2, ... ,5 expert networks 
using a backward elimination style method to choose the informative co- 
variates. Interaction terms were avoided. A model with all six covariates 
was initially fitted, then models with only five of the covariates. From this 
set of models the "best" model as deemed by the BIC was selected and 
models with only four of the selected covariates were then fitted. This se- 
lection of the best subset of covariates and then backward elimination was 
continued until only one covariate was left in the model. The BIC values for 
all the fitted models were then compared. Table 3 details the five best fitting 
models as deemed by their BIC values. The covariates within each selected 
model are also detailed. 

Each of the five best fitting models considered age and/or government 
satisfaction as important covariates. The optimal model with K = 4: expert 
networks where age and government satisfaction are the influential covariates 
is discussed below. 

5.1. Benter support parameter estimates. Figure 3 is a mosaic plot il- 
lustrating the Benter support parameter estimates within each of the four 
voting blocs in the optimal model. The voting blocs are each represented 
by a column and their associated marginal membership probabilities are 
reported. 

Voting bloc 1 appears to favor the conservative candidates of McAleese 
and Scallon. The opinion poll was conducted at an early stage of the electoral 
campaign and Scallon had not yet established herself as a main presidential 
contender. Thus, the 31% support for Scallon in this voting bloc is the largest 
support she obtains. Voting bloc 2 also reveals characteristics of the early 
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Voting bloc 4 



Banotti 



McAleese 



Nally 



Roche 



Seal Ion 




7T1-0.19 



7t2-0.16 



7C3-0.35 



7t4-0.3 



Fig. 3. A graphical representation of the maximum likelihood estimates of the Benter 
support parameters for the Irish Marketing Surveys opinion poll. Each column of the mo- 
saic represents an expert network or voting bloc — the segments within the columns repre- 
sent the magnitudes of the support parameters for the candidates within each voting bloc. 
The maximum likelihood estimates of the support parameters are detailed within each seg- 
ment; standard errors are provided in parentheses. The width of each column represents 
the marginal probability tt^ of belonging to each voting bloc k. 



stages of the presidential campaign. Adi Roche has large support in this vot- 
ing bloc — at the start of the campaign Roche was a very popular candidate, 
but her support quickly dropped when she became embroiled in difficulties 
and her campaign went into decline. Voting bloc 3, the largest voting bloc in 
terms of marginal membership probabilities, has a large support parameter 
for Mary McAleese. McAleese, who was subsequently elected, was backed 
by the current governmental political party, Fianna Fail, and thus, she had 
a high public profile. There is also some level of support for the other high 
profile candidate, Mary Banotti. Voting bloc 4 is a pro-Banotti voting bloc 
with more uniform levels of support for the other candidates. Of note are 
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the low levels of support for Nally in all of the voting blocs — Nally joined 
the electoral campaign later than the other candidates on September 29th 
and so had little time to win support prior to this October 2nd poll. 

5.2. Benter dampening parameter estimates. Under the optimal model, 
the Benter dampening parameter estimates are 

a = (1.00, 0.99(0.10), 0.97(0.12), 0.99(0.15), 1-00) 

(standard errors are given in parentheses). The estimates suggest that the 
certainty with which voters rank their preferences remains constant with 
respect to choice level. The proximity of the dampening parameter estimates 
to 1, along with their relatively large standard errors, suggest a Plackett- 
Luce model (Section 3.1) would be adequate for modeling this poll data. 

The Benter dampening parameters appear to depend somewhat on the 
cardinality of the choice set; in this case, where the choice set is small, 
a ~ 1. In Gormley and Murphy (2008a) the Benter model is employed when 
modeling a larger choice set and a is shown to differ from 1.. Intuitively, 
the certainty associated with the ranking of objects from a small choice set 
would be greater than that associated with the ranking of objects from a 
large choice set. 

5.3. Gating network parameter estimates. Under the MoE model for 
rank data, the gating network coefficients are functions of voter covariates. 
According to the BIG (see Table 3), the "age" and "government satisfac- 
tion" covariates influence the voting bloc membership probabilities of a 
voter. Table 4 details the associated gating network parameter estimates, 
their odds ratios and the relevant 95% confidence intervals for the odds 
ratios. The gating network parameters associated with the "conservative" 
voting bloc (i.e., voting bloc 1) are used as the reference parameters, that 
is, ^^_ = (/3io,...,/3il) = (0,...,0). 

Within the smallest voting bloc (i.e., voting bloc 2 or the pro- Roche bloc), 
for every one unit increase in age the odds for being best described by voting 
bloc 2 are 100 times less than the odds for being described by the conserva- 
tive voting bloc 1. This would appear to be an intuitive characteristic of the 
Irish electorate — the more elderly generations in Ireland would, in general, 
be considered more conservatively minded. Note also the relatively small 
associated odds ratio confidence interval. The 95% confidence intervals for 
the government satisfaction covariate odds ratios both enclose 1, implying 
it is likely that the political views of voters in this bloc have little influence. 
Thus, younger voters appear to be best described by voting bloc 2 and are 
more in favor of the liberal Adi Roche. 

In terms of the gating network parameters which refer to voting bloc 3 
(the pro-McAleese bloc), the confidence interval for the age parameter odds 
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Table 4 

Gating network parameter estimates 0^, the associated odds ratios and the 95% odds 
ratio confidence intervals under the MoE model for rank data fitted to the Irish 
Marketing Surveys opinion poll data. The covariates selected as informative are age and 

government satisfaction. 
"Do not know/no opinion" was used as the reference level within the categorical 
government satisfaction covariate 







Intercept 


Age 


Satisfied 


Not satisfied 


Voting 
bloc 
2 


Log odds 0^) 
Odds ratio [exp(/3^)] 
95% CI (Odds ratio) 


0.92 
2.52 
[0.78, 8.16] 


-5.16 
0.01 
[0.00,0.05] 


0.13 
1.14 
[0.42, 3.11] 


1.03 
2.80 
[0.77, 10.15] 


Voting 
bloc 
3 


Log odds {$J 
Odds ratio [exp(/3^)] 
95% CI (Odds ratio) 


-0.46 
0.63 
[0.16, 2.49] 


-0.05 
0.95 
[0.32,2.81] 


1.14 
3.12 
[0.94, 10.31] 


1.33 
3.81 
[0.90, 16.13] 


Voting 
bloc 
4 


Log odds iP^) 
Odds ratio [exp(/3^)] 
95% CI (Odds ratio) 


0.54 
1.71 

[0.52, 5.58] 


0.44 
1.56 
[0.35, 6.91] 


-1.05 
0.35 
[0.12, 0.98] 


1.25 
3.50 
[1.07, 11.43] 



ratio includes 1, suggesting age is not a driving covariate. The government 
satisfaction covariate appears to be more influential: the odds of a voter 
being best described by voting bloc 3 are around 3 times greater than the 
odds for voting bloc 1, given that the voter has some political opinion. Thus, 
voters with an interest in politics appear to favor IVIary IMcAleese. 

The gating parameters for voting bloc 4 indicate that voters with a dislike 
for the current government favor IVIary Banotti. The confidence interval for 
the age covariate again includes 1, suggesting it has little effect. The odds of 
a voter who indicated a dislike for the 1997 government (a coalition govern- 
ment of Fianna Fail and the Progressive Democrats) being best described 
by voting bloc 4 were 3.50 times greater than being described by voting 
bloc 1. In contrast, the odds of a voter in favor of the current government 
being best described by voting bloc 4 are 0.35 times greater than the odds 
for voting bloc 1. These results make intuitive sense within the context of 
the 1997 presidential election. IVIary Banotti was endorsed by Fine Gael, 
the main opposition party to Fianna Fail. Thus, voters best described by 
voting bloc 4 appear to be Fine Gael supporters. Those voters in favor of 
the 1997 coalition government were more likely to be described by voting 
bloc 1, which had large levels of support for Fianna Fail backed IVlcAleese. 

6. Comparison. 

6.1. Results. The analysis completed here is an extension of previous 
work exploring voting blocs in Irish elections [Gormley and Murphy (2008a)]. 



A MIXTURE OF EXPERTS MODEL FOR RANK DATA 



17 



The mixture of experts model provides an extension of the mixture model 
because it allows us to assess which social factors influence voting bloc mem- 
bership. The mixture of experts analysis suggests that "age" and "govern- 
ment satisfaction" influence voting bloc membership. 

The analysis presented in Gormley and Murphy (2008a), for the same 
opinion poll presented here, finds just two voting blocs in the electorate. A 
mosaic plot illustrating the support parameters in the two voting blocs is 
given in Figure 4. The first voting bloc is a "noise" component where each 
candidate has equal support; such a component can collect small voting blocs 
and voters with unusual preference patterns. In this case, voting bloc 1 and, 
to an extent, voting bloc 4 from the mixture of experts model (Figure 3) are 
being combined in the noise group. The second voting bloc in Figure 4 is very 
similar to voting bloc 3 in the mixture of experts analysis, but also contains 
some of the voters from voting blocs 2 and 4. It is not surprising that more 
voting blocs are found using the mixture of experts model, because voting 
bloc membership is estimated from covariates recording social factors and 
voting behavior, whereas in the mixture model we only have voting data. 
Hence, the mixture of experts model is able to exploit the more detailed 
structure within the electorate than the standard mixture model can. 

A comparison of the voting blocs found in the analysis of an opinion 
poll one month prior to the election (as presented here) with the voting 
blocs found for the exit poll [as shown in Figure 2 in Gormley and Murphy 
(2008a)] shows very different results. Noticeably, voting bloc 2 in the mixture 
of experts analysis is not present in the exit poll analysis — this is because 
support for Roche collapsed in the intervening month. However, both polls 
show voting blocs that have strong support for McAleese and Banotti re- 
spectively. 

A latent space model was used in Gormley (2006) and Gormley and Murphy 
(2007) to model rank data and, in particular, Irish election data. They 
showed that the 2002 Irish general election and the 1997 presidential elec- 
tions can be modeled using a one or two-dimensional latent space. In par- 
ticular, the opinion poll considered in this paper was analyzed in Gormley 
(2006) and it was found that the election could be modeled using a one or 
two-dimensional latent space. The low dimensionality of the latent space 
found in Gormley (2006) is mirrored in the small number of voting blocs 
found in the mixture model in Gormley and Murphy (2008a) and the mix- 
ture of experts model here. 

An analysis of the opinion poll data from the presidential election pre- 
sented in van der Brug, van der Eijk and Marsh (2000) showed that a large 
number of voters' first preference candidate coincided with their first pref- 
erence national party, so the election had a partisan aspect to it. However, 
they suggest that the election was not strongly partisan. Our analysis gives 
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Banotti 
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Nally 



0.2 
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Fig. 4. A graphical representation of the maximum likelihood estimates of the support 
parameters in the two component Plackett-Luce mixture model with a noise component. 
Each column of the mosaic represents a mixture component or voting bloc — the segments 
within the columns represent the magnitudes of the support parameters for the candidates 
within the voting bloc. The width of each column represents the mixture component prob- 
ability TTfe of belonging to each voting bloc. Standard errors of the estimates are given in 
parentheses. 



similar results in suggesting that support for the government was an impor- 
tant aspect in the membership of voting blocs but that voter age was also 
influential. 

6.2. Methods. The methods presented in this work are closely related to 
other statistical methods proposed for the analysis of social science data. Of 
particular relevance are the following methods. 

A mixture model was used in Hill (2001) for the analysis of opinion- 
changing behavior so that different types of opinion behaviors could be ac- 
commodated. 
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An alternative extension to the mixture model is the mixed membership 
model for multivariate binary data [Pritchard, Stephens and Peter (2000), 
Erosheva, Fienberg and Joutard (2007)] and network data [Airoldi et al. 
(2008)]; such models could be used to study voting blocs in elections that use 
an approval voting system or votes within a parliamentary system. A mixed 
membership model for rank data has been developed in Gormley (2006) 
and Gormley and Murphy (2008) to accommodate mixed membership of 
voting blocs in STV voting data. The mixed membership model finds more 
extreme voting blocs than the mixture model because voters are allowed to 
have partial membership of more than one bloc. 

The existence and characterization of voting blocs in the US Senate has 
been considered in Jakulin and Buntine (2004). Their analysis using ex- 
ploratory and model-based techniques reveals three Republican and two 
Democrat voting blocs in the US Senate. An analysis of Asian voting be- 
havior in Tam (1995) showed that Asian voters should not be treated as a 
monolithic voting bloc. Multidimensional scaling methods were utilized in 
Holloway (1990) to study voting blocs and their development over time in 
the United Nations General Assembly. 

7. Discussion and further work. This article develops a mixture of ex- 
perts model for rank data coupled with an efficient hybrid EMM algorithm 
for model fitting. The model is employed as a model-based clustering tech- 
nique in which covariate information contributes to the clustering solution. 
The covariate information contributes to the clustering solution by modeling 
the component membership parameters of an observation as a generalized 
linear function of observation covariates. Within the context of rank data, 
each component in the population is characterized by an appropriate rank 
data model, the Benter model for rank data. 

The MoE model for rank data has been used to establish the presence 
of four voting blocs in the electorate one month prior to the 1997 Irish 
Presidential Election. The voting blocs show that the electorate is divided 
on an ideological and partisan basis with each of the prominant candidates 
having a bloc of support. We found that age and government satisfaction 
levels were important social factors in determining membership of the voting 
blocs. 

The dampening parameter in the Benter model indicates that voters were 
selecting all of their preferences with great care. In fact, the fitted Benter 
model is very close to the Plackett-Luce model because the dampening pa- 
rameter is almost equal to one for each choice level. This phenomenon can be 
explained by the fact that the election only had five candidates. It has pre- 
viously been shown in Gormley and Murphy (2008a) that in elections with 
a greater number of candidates the voters select lower preference candidates 
with less certainty. 
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The MoE model was able to find a more detailed voting bloc structure 
than a mixture model analysis of the data. This was because the covariates 
and the voting behavior both contribute to the structure of the voting blocs, 
whereas in a mixture model only voting behavior does. 

Although the MoE model for rank data proved an appropriate model for 
Irish voting data, there is still much scope for future research. As stated, 
a single layer MoE model rather than a hierarchical model was assumed to 
be sufficient in this context due to the small number of candidates in the 
presidential race. Irish governmental elections typically involve a large num- 
ber of candidates and in such a context a single layer model is unlikely to 
be sufficient. Within the methodology developed in this article, there is no 
natural metric for selecting the complexity and structure (i.e., the topology) 
of a hierarchical MoE tree. The use of variational Bayesian methods to esti- 
mate a hierarchical MoE model [Bishop and Svensen (2003)] allows both the 
number of experts and the topology of the associated tree to be determined 
within a statistically sound framework. The extension to the estimation of 
a hierarchical MoE model for rank data is a future area of research. 

If a multi-layer MoE model is appropriate, the underlying intuition is 
that the choice process of a voter is a nested process. For example, perhaps 
a conservatively minded voter chooses a set of conservative candidates first 
and then from that set chooses a particular candidate. Nested choice models 
[McFadden (1978), Train (2003)] could be used to model such a choice proce- 
dure. These models assume that choices are made in a hierarchical manner; 
the voters begin with coarse categories which are refined during the choice 
process. Nested choice models could be extended to nested ranking models 
using a multi-stage ranking model approach. 

The scope of MoE models for rank data lies beyond modeling Irish election 
data. Many other nations employ preference based voting systems 
[Gormley and Murphy (2008a)] and the model proposed here can be easily 
adapted to model such electorates. A number of scholarly societies, including 
the Institute of Mathematical Statistics and the Royal Statistical Society, 
use STV voting in their elections and the methods proposed here could be 
applied to the analysis of their elections. The methodology presented may 
also be utilized to model other preference data. Irish third level college appli- 
cation choices are analyzed in Gormley and Murphy (2006) using a mixture 
of Plackett-Luce models and establish the existence of homogeneous groups 
of applicants. The extension of this research to examine the influence of 
applicant covariates on third level course choices is a topic of social and ed- 
ucational interest. Additionally, the proposed methodology could be applied 
to the analysis of customer choice data in marketing applications, where 
customers express preferences for different products. 
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Table 5 



Code Socioeconomic definition 

AB upper middle class & middle class 

CI lower middle class 

C2 skilled working class 

DE other working class & lowest level of subsistence 

F50+ large farmers 

F50- small farmers 



APPENDIX A: DATA SOURCES 

The 1997 Irish presidential opinion poll data set was collected by Irish 
Marketing Surveys and is available through the Irish Elections Data Archive 
http : //www. ted. ie/Political_Science/elections/elections .html, 
which is maintained by Professor Michael Marsh in the Department of Po- 
litical Science, Trinity College Dublin, Ireland. 

APPENDIX B: SOCIOECONOMIC GROUP CODES 

Definitions of the socioeconomic group codes used in the opinion poll con- 
ducted by Irish Marketing Surveys are provided in Table 5. Further details 
may be obtained from Millward Brown/Irish Marketing Surveys Limited, 
www . millwar dbrown .com. 

APPENDIX C: THE EMM ALGORITHM FOR THE MIXTURE OF 

EXPERTS MODEL FOR RANK DATA 

Supplementary material [Gormley and Murphy (2008c)] provides a pro- 
gram in C code which may be used to implement the EMM algorithm for 
the mixtures of experts model for rank data. 

When fitting a MoE model for rank data, the EMM algorithm consists of 
the following steps: 

0. Let h = and choose initial parameter estimates for the Benter model 
parameters p^'^^ a^^^ and for the gating network parameters (3^'^\ 

1. E step: Compute the estimates 

Zik = ttn for 1 = 1,. . . ,M and fe = 1, . . . , A . 

Note that by (3.2) the gating network coefficients are defined by 



E^=iexp(/3l',m 
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2. M step: Substituting the Zik values obtained in the E step into the com- 
plete data log likelihood forms the "Q function" 



(C.l) 



M K 
i=l k=l 



N 



t=l 



at 
c{i,s) 



which is maximized with respect to the model parameters during the M 
step. The dependence of the parameters in Q on estimates from the hth 
iteration of the algorithm is implicit; the notation is suppressed here for 
reasons of clarity. Due to maximization difficulties, steps from the MM 
algorithm are embedded in the M step to obtain MLEs of the parameters. 
Details of the MM algorithm steps are detailed in Appendix C.l. The new 
maximizing values are p^'^"''^^ a^^~^^^ and P^^'^^K 
3. If converged, then stop. Otherwise, increment h and return to Step 1. 



C.l. The M step. The gating network parameters f3 and the Benter 
model parameters (p,a) influence the Q function (C.l) through distinct 
terms. Hence, the M step reduces to separate maximization problems for 
each parameter set. Moreover, an ECM algorithm is implemented where 
the M step consists of a series of conditional maximization steps. Here, the 
conditional maximizations are with respect to p^,. . . 02, . • . ,aN-i and 

^2' • • • 'i^x' 

The conditional maximizations are difficult in practice and are therefore 
implemented using the MM algorithm. This algorithm works by first con- 
structing a surrogate function which minorizes the objective Q function and 
then maximizing the minorizing surrogate function. This process is iterated 
leading to a sequence of parameters estimates giving increasing values of the 
objective Q function. 

To construct surrogate functions, mathematical properties of the objec- 
tive function, or of terms within it, are exploited. One such property is the 
supporting hyperplane property (SHP) of a convex function. If f{9) is a 
convex function with differential f'{0), then the SHP states that 

(C.2) f{9) > fie'^^^) + f'{e^'^'^){e - e'-''^). 

The SHP provides a linear minorizing function which is an ideal candidate 
for a surrogate function in an optimization transfer algorithm. 

Sometimes it may be preferable to form a quadratic or higher order surro- 
gate function. For example, if f{9) is a concave function bounding it around 
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, using a quadratic gives 

/(0)>/(0W) + [/'(^W)f(0-0W) 

(C.3) 

where B is a negative definite matrix such that H{6^^^) > B and H{9^^^) is 
the Hessian cff/d{e^^^f. 

Both these tools are used within the EMM algorithm for rank data as 
detailed below: 

1. Maximization with respect to the Benter support parameters. When con- 
ditionally maximizing with respect to pkj, the dampening parameters 
are treated as fixed constants, a, equal to the estimates from the previ- 
ous iteration. Within the Q function (C.l), the term — logX]fLtP^!c(j s) 
problematic in terms of optimization with respect to pkj- However, since 
the — \og{9) function is a strictly convex function, a linear minorizing 
surrogate function may be obtained via the SHP (C.2), that is, 

TV TV jfit 

-iogE^'fc(..)>-iogEC(..) + i ""'^^ 



s=t s=t 



y-iv -at 



where pkj is a constant and, in practice, is the estimate of pkj from the 
previous iteration. Substituting the nonconstant terms into the objective 
function, it follows that, up to a constant, 



i=i k=i t=i 



<3 > E E E ^ifc iogpfcc{i,t) - ( 

^l^s=tPkc{i,s) 



which still poses maximization problems. However, implementing the 
SHP (C.2) of the convex function f[p) = — p"*, 

_pat > _pat _ atp^t-i^p _ 
again provides the surrogate function 

at logPfcc(i,t) - \ EC(^,.) f \ E (^tPZl^,)Pkc(i,s) 



M K rii 

i=lk=lt=l 



^s=t J \s=t 



up to a constant. Iterative maximization of the surrogate function pro- 
duces a sequence of pkj values which converge to a maximum of Q. 
Straight forward maximization provides 

. ^^^fcj 

Pkj-^M v-n, ^ (^N -at ._w„Ar+i s^-l . y 
2^i=l 2^t=l^ik\2^s=tPkc{i,s)} \2^s=t "-tPkj "U^J 
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where 

M Ui 

^kj = E E ^*fcatl{j=c(i,s)}> 
i=l t=l 

given that is the usual indicator function and 



JlJS 



1, if j = c{i, s) and 1 < s < 

1, if j ^ c{i, I) for 1 < / < n,/ and s = iV + 1, 

0, otherwise. 



The methodology presented here is similar to that used when a mix- 
ture of Benter models is fitted via the EMM algorithm as detailed in 
Gormley and Murphy (2008a). 
2. Maximization with respect to the Benter dampening parameters. In this 
case the support parameters are treated as constant with p/^j denoting 
the estimate from the previous iteration. Returning to the original objec- 
tive function (C.l), the problematic term — log J2s^tPkc{i s) ^® ^ convex 
function of at, and employing the SHP (C.2) again gives 



M K rii 

i=i fc=it=i 



OitlogPkc(i,t) + 



^s=tPkc{i,s) 
^s=tPkc{i,s) , 



As before, this surrogate function still poses optimization problems. How- 
ever, as f{a) = —p" is a concave function, by (C.3), 

> -p" - (logp)p°(a - a) - l/2(a - af{\ogpf, 

since H{a) > B = — (logp)^. This provides the surrogate function 

M K rii 

Q>EEE^ifc Oitlogpkc{i,t) 

i=lfc=li=l 

\s=t / {s=t 

- l/2{at - at)'^{logpkc{i,s)f) 

up to a constant which is a quadratic in at- Iterative maximization leads 
to a sequence of at estimates which converge to a local maximum of Q. 
Similar methodology is implemented in Gormley and Murphy (2008a) 
and formulae for dt may be found therein. 
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3. Maximization with respect to the gating network parameters. Maximiza- 
tion of (C.l) with respect to the gating network parameters (iki for 
k = 2,. . . ,K and I = 0, . . . , L is also not straight forward. The MM algo- 
rithm for logistic regression is detailed in Hunter and Lange (2004) and 
similar methodology is implemented here to achieve MLEs of the gating 
network parameters. 

The Q function, up to a constant, as a function of P is 



M r K ( K 



(C.4) Q = Y^ 



i=l LA:=1 (.k'=l 

since, by definition, J2f=i ^ik = 1- As (C.4) is a concave function, by (C.3), 
the quadratic function of 

Q(^?)) + Q'iPi'YiP, - + l/2(^, - SYm^ - §!k^) 
minorizes QiPjJ at the point (3^/^^ where B = — l/4:J2iLiWiw[ such that 

Maximizing this minorizing surrogate function gives the iterative up- 
date formula 

(C.5) #+^)=#)-B-iQ'(^f)), 

which only requires the inversion of B once during the iterative algorithm. 
The similarity with the well-known Newton-Raphson update is apparent 
— the MM algorithm update (C.5) trades the computational inefficiency 
of the Newton-Raphson update for an increased number of iterations. 

By embedding these MM algorithm steps in the M step of the EM al- 
gorithm, a sequence of parameter estimates is produced which converges 
to (local) MLEs of the Benter model parameters (p,a) and of the gating 
network parameters (3. 
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SUPPLEMENTARY MATERIAL 

Computing code for "A mixture of experts model for rank data with 
applications in election studies" [Gormley and Murphy (2008c)] 

(DOL 10.1214/08-AOAS178SUPP; .zip). This package contains the data 
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and C programs used to produce the results in this manuscript. The code 
is explained in the file README.txt and is easily modified to fit the model 
to alternative data. 
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